Interpretable Machine Learning-Based Disease Prediction Using Clinical and Lifestyle Indicators

Authors: Aakash Tomar, Ajay Singh Tomar, Mr. Mukesh Raj

DOI Link: https://doi.org/10.22214/ijraset.2026.83282

Abstract

The ML framework proposed in this paper is interpretable and is applied for predicition of disease-risk based on the 1,500 synthetic test-bed structured data. This data combines both demographic and clinical, lifestyle and symptom data, and even derived risk data, without deployment claims or patient identifiers. Five classifiers (Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), Gradient Boosting (GB) and XGBoost type ensemble learning (EL)) were tested in fixed training, validation and testing phases. Gradient Boosting got the best performances among the classifiers in terms of accuracy, macro F1 score and ROC-AUC for the multiclass case with 0.8178, 0.7652 and 0.9542 respectively. The best binary screening result was obtained by the Random Forest model which achieved an accuracy of 0.8978 and ROC-AUC of 0.9632. Coherent discrimination is indicated by the confusion-matrix, the ROC and the feature-importance analysis that provides class specific error-sensitivity. The work presents a baseline for model-development which can be reproduced, and is not a clinical diagnostic claim.

Introduction

This study presents a machine learning-based disease-risk prediction framework designed for classification rather than clinical diagnosis. It uses a synthetic dataset of 1,500 records containing demographic, clinical, laboratory, symptom, lifestyle, and family-history variables, with no real patient data involved. The objective is to estimate disease risk and support research on screening and triage while maintaining transparency, reproducibility, and methodological rigor.

The prediction task includes both binary classification (disease risk vs. no risk) and five-class classification (cardiovascular risk, diabetes risk, kidney risk, multiple-risk cluster, and no apparent disease risk). The study emphasizes clear reporting of data definitions, validation procedures, bias assessment, and intended use, reflecting current recommendations for machine learning in healthcare.

To address common limitations in clinical prediction research, the authors develop a standardized pipeline involving data preprocessing, feature normalization, missing-value imputation, leakage prevention, and fixed training-validation-test splits (1,050/225/225 records). Several supervised learning models are compared under identical experimental conditions, including Logistic Regression, Support Vector Machine (SVM), Random Forest, Gradient Boosting, and XGBoost-style ensemble learning.

The literature review highlights major challenges in healthcare machine learning, such as inadequate validation, poor reporting practices, inconsistent preprocessing methods, risk of bias, and limited reproducibility. The study seeks to fill these gaps by providing a controlled, auditable testbed with a common dataset, feature set, and evaluation protocol.

Model performance is assessed using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC, while interpretability is supported through feature-importance analysis. Key predictive variables include HbA1c, fasting glucose, blood pressure, LDL cholesterol, creatinine, GFR, BMI, smoking exposure, physical activity, and family history. The study stresses that feature importance reflects model behavior rather than medical causality.

Overall, the research demonstrates a transparent and controlled framework for disease-risk prediction using machine learning. However, it explicitly avoids claims of clinical deployment or effectiveness, emphasizing that external validation, calibration assessment, and testing on real patient populations are required before any clinical application can be considered.

Conclusion

A simple and synthetic testbed dataset of 1,500 structured data records was utilized to build & test an interpretable ML system that can predict disease-risks. It was not really applied in a clinical context, although some real clinical assertions were not attempted because this data-set was created for algorithmic purposes. Within this fence the paper has an invulnerable defence position from the methodological point of view. The scheme adopted here did not take into account leakage fields, did not change how the train, validation and test sets are split and compared the classifiers under same input. The highest test accuracy, macro F1 score and ROC-AUC scores were obtained by Gradient Boosting at 0.8178, 0.7652 and 0.9542 respectively. Among the binary risk-screening results, Random Forest (RF) was the best performing model with an accuracy of 0.8978, macro-average F1-score of 0.8738 and ROC-AUC of 0.9632. As depicted in the confusion matrix, the recovery of cardiovascular and no-risk class was greater as there were lesser number of individuals in the diabetes and multiple-risk-cluster classes and the clinical overlap of diabetes and multiple-risk-class. Through feature importance analysis, we found HbA1c, fasting glucose, systolic blood pressure, eGFR, creatinine, LDL cholesterol, BMI, smoking exposure and physical activity to be the most important drivers of our model. These results suggest that there might be clinical and lifestyle type structured disease-risk indicators for controlled disease-risk modelling. Do not diagnose as an impairment. Future investigation of the pipeline should be done with external data acquired in the clinic, checking for calibration drift, fairness test for subgroups, and consideration of changing thresholds to optimize the screening scenario; post-hoc evaluation also with/without clinicians input about the relevance of the features. During the prospective phase the privacy review, institutional approvals, monitoring the changes in the distribution and rules especially for them incorporating with the clinicians monitoring would have to be accomplished as well. The study has immediate impact and the value of the study is how it can be reproduced, its explicit metrics, controlling leakage and the limited interpretation. Well, that\'s not necessarily bad, just a little added self-control. The absolute lowest benchmark to be considered a credible medical machine learning works. Calibration plots, decision-curve analysis, temporal holdout testing, and a small clinician interface to enable a review of the predictions under practical review conditions and under audit testing and for understanding of the predictions should also be included in the next version.

References

[1] G. S. Collins et al., “TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods,” BMJ, vol. 385, Art. no. e078378, 2024, doi: 10.1136/bmj-2023-078378. [2] K. G. M. Moons et al., “PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods,” BMJ, vol. 388, Art. no. e082505, 2025, doi: 10.1136/bmj-2024-082505. [3] B. Vasey et al., “Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI,” Nature Medicine, vol. 28, no. 5, pp. 924–933, 2022, doi: 10.1038/s41591-022-01772-9. [4] A. Zhang, S. Xing, J. Zou, and J. C. Wu, “Shifting machine learning for healthcare from development to deployment and from models to data,” Nature Biomedical Engineering, vol. 6, pp. 1330–1345, 2022, doi: 10.1038/s41551-022-00898-y. [5] M. Poddar, J. S. Marwaha, W. Yuan, et al., “An operational guide to translational clinical machine learning in academic medical centers,” npj Digital Medicine, vol. 7, Art. no. 129, 2024, doi: 10.1038/s41746-024-01094-9. [6] N. Pudjihartono, T. Fadason, A. W. Kempa-Liehr, and J. M. O’Sullivan, “A review of feature selection methods for machine learning-based disease risk prediction,” Frontiers in Bioinformatics, vol. 2, Art. no. 927312, 2022, doi: 10.3389/fbinf.2022.927312. [7] C. L. Andaur Navarro et al., “Risk of bias in studies on prediction models developed using supervised machine learning techniques: Systematic review,” BMJ, vol. 375, Art. no. n2281, 2021, doi: 10.1136/bmj.n2281. [8] C. L. Andaur Navarro et al., “Completeness of reporting of clinical prediction models developed using supervised machine learning: A systematic review,” BMC Medical Research Methodology, vol. 22, Art. no. 12, 2022, doi: 10.1186/s12874-021-01469-6. [9] C. L. Andaur Navarro et al., “Systematic review identifies the design and methodological conduct of studies on machine learning-based prediction models,” Journal of Clinical Epidemiology, vol. 154, pp. 8–22, 2023, doi: 10.1016/j.jclinepi.2022.11.015. [10] O. Efthimiou, M. Seo, K. Chalkou, T. P. A. Debray, M. Egger, and G. Salanti, “Developing clinical prediction models: A step-by-step guide,” BMJ, vol. 386, Art. no. e078276, 2024, doi: 10.1136/bmj-2023-078276. [11] G. S. Collins et al., “Evaluation of clinical prediction models, part 1: From development to external validation,” BMJ, vol. 384, Art. no. e074819, 2024, doi: 10.1136/bmj-2023-074819. [12] N. H. Alhumaidi, D. Dermawan, H. F. Kamaruzaman, and N. Alotaiq, “The use of machine learning for analyzing real-world data in disease prediction and management: Systematic review,” JMIR Medical Informatics, vol. 13, Art. no. e68898, 2025, doi: 10.2196/68898. [13] M. Badawy, N. Ramadan, and H. A. Hefny, “Healthcare predictive analytics using machine learning and deep learning techniques: A survey,” Journal of Electrical Systems and Information Technology, vol. 10, Art. no. 40, 2023, doi: 10.1186/s43067-023-00108-y. [14] R. Islam, M. Sultana, M. H. Rahman, and M. A. Islam, “A comprehensive review for chronic disease prediction using machine learning algorithms,” Journal of Electrical Systems and Information Technology, vol. 11, Art. no. 27, 2024, doi: 10.1186/s43067-024-00150-4. [15] D. J. Park, M. W. Park, H. Lee, Y.-J. Kim, Y. Kim, and Y. H. Park, “Development of machine learning model for diagnostic disease prediction based on laboratory tests,” Scientific Reports, vol. 11, Art. no. 7567, 2021, doi: 10.1038/s41598-021-87171-5. [16] A. Mohamed, M. Abdelrehim, and R. Al-Barazie, “Context matters in machine learning based disease prediction with insights from diverse clinical and symptom data,” Scientific Reports, vol. 15, Art. no. 26855, 2025, doi: 10.1038/s41598-025-26855-8. [17] N. G. Ramadhan, K. Adiwijaya, W. Maharani, and A. A. Gozali, “Chronic diseases prediction using machine learning with data preprocessing handling: A critical review,” IEEE Access, vol. 12, pp. 80698–80730, 2024, doi: 10.1109/ACCESS.2024.3406748. [18] J. Rashid, S. Batool, J. Kim, M. W. Nisar, A. Hussain, S. Juneja, and R. Kushwaha, “An augmented artificial intelligence approach for chronic diseases prediction,” Frontiers in Public Health, vol. 10, Art. no. 860396, 2022, doi: 10.3389/fpubh.2022.860396. [19] T. Tabashum, R. C. Snyder, M. K. O’Brien, and M. V. Albert, “Machine learning models for Parkinson disease: Systematic review,” JMIR Medical Informatics, vol. 12, Art. no. e50117, 2024, doi: 10.2196/50117. [20] M. B. Makarious, H. L. Leonard, D. Vitale, H. Iwaki, L. Sargent, and A. Dadu et al., “Multi-modality machine learning predicting Parkinson’s disease,” npj Parkinson’s Disease, vol. 8, Art. no. 35, 2022, doi: 10.1038/s41531-022-00288-w. [21] M. M. Ali, B. K. Paul, K. Ahmed, F. M. Bui, J. M. W. Quinn, and M. A. Moni, “Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison,” Computers in Biology and Medicine, vol. 136, Art. no. 104672, 2021, doi: 10.1016/j.compbiomed.2021.104672. [22] P. Ghosh, S. Azam, A. Karim, M. Jonkman, A. Anwar, and M. D. R. Islam, “Efficient prediction of cardiovascular disease using machine learning algorithms with relief and LASSO feature selection techniques,” IEEE Access, vol. 9, pp. 19304–19326, 2021, doi: 10.1109/ACCESS.2021.3053759. [23] C. M. Bhatt, P. Patel, T. Ghetia, and P. L. Mazzeo, “Effective heart disease prediction using machine learning techniques,” Algorithms, vol. 16, no. 2, Art. no. 88, 2023, doi: 10.3390/a16020088. [24] H. A. Al-Alshaikh, P. Prabu, R. C. Poonia, A. K. J. Saudagar, M. Yadav, H. S. AlSagri, and A. A. AlSanad, “Comprehensive evaluation and performance analysis of machine learning in heart disease prediction,” Scientific Reports, vol. 14, Art. no. 7819, 2024, doi: 10.1038/s41598-024-58489-7. [25] H. El-Sofany, B. Bouallegue, and Y. M. Abd El-Latif, “A proposed technique for predicting heart disease using machine learning algorithms and an explainable AI method,” Scientific Reports, vol. 14, Art. no. 23277, 2024, doi: 10.1038/s41598-024-74656-2. [26] X. Liu, W. Zhang, Q. Zhang, C. Chen, T. Zeng, and J. Zhang et al., “Development and validation of a machine learning-augmented algorithm for diabetes screening in community and primary care settings: A population-based study,” Frontiers in Endocrinology, vol. 13, Art. no. 1043919, 2022, doi: 10.3389/fendo.2022.1043919. [27] Y. Jian, M. Pasquier, A. Sagahyroon, and F. Aloul, “A machine learning approach to predicting diabetes complications,” Healthcare, vol. 9, no. 12, Art. no. 1712, 2021, doi: 10.3390/healthcare9121712. [28] O. T. Kee, N. L. Y. Liew, S. W. Lee, and C. S. Thong, “Cardiovascular complications in a diabetes prediction model using machine learning: A systematic review,” Cardiovascular Diabetology, vol. 22, Art. no. 13, 2023, doi: 10.1186/s12933-023-01741-7. [29] M. A. Islam, M. Z. H. Majumder, and M. A. Hussein, “Chronic kidney disease prediction based on machine learning algorithms,” Journal of Pathology Informatics, vol. 14, Art. no. 100189, 2023, doi: 10.1016/j.jpi.2023.100189. [30] H. Khalid, A. Khan, M. Z. Khan, G. Mehmood, and M. S. Qureshi, “Machine learning hybrid model for the prediction of chronic kidney disease,” Computational Intelligence and Neuroscience, vol. 2023, Art. no. 9266889, 2023, doi: 10.1155/2023/9266889.

Copyright

Copyright © 2026 Aakash Tomar, Ajay Singh Tomar, Mr. Mukesh Raj. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET83282

Publish Date : 2026-05-29

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here